Prepared by
Chris Bacani
Data Scientist Apprentice at IBM
Our objective is to identify factors that contribute to passenger satisfaction or dissatisfaction, so that the airline (client) can address them to enhance customer service and improve customer retention.
I decided to take on a data project regarding airline customer service data because of my passion for aviation. I have long been an avid enthusiast of aviation - and particularly commercial aviation. One of my primary goals as I grow in my role at IBM is to help airlines, cargo carriers, and airports solve different problems through innovative data-informed solutions.
With air travel beginning to resume after a grinding halt brought about by the COVID-19 pandemic, attention has been brought back to customer satisfaction - in most cases as a response to reduced or altered cabin service or public health policies brought about by the airline companies. However, passenger satisfaction was wobbly even before the pandemic. Airliners, particularly in the United States have struggled to find a balance between service and value - prioritizing only one over the other.
As an apprentice for IBM's Data Science Elite team - one huge part of our curriculum is participating on projects to solve data science problems for our clients. This project is meant to mirror typical end-to-end data science project that we would complete for our clients. From data ingestion, to cleaning and analysis, to modeling, to deployment and explainability.
This use case is one that I look forward to taking on for an air travel client in the future. Analyzing their customer service data and then passing it through a machine learning model to identify and verify factors that contribute to passenger satisfaction.
I found this dataset on Kaggle.com when searching for datasets that detailed passenger satisfaction. The data has been pre-split into train and test sets, and a target column has been engineered. So there won't be as much cleaning and manipulation required.
Many of the aspects that make up customer service data have been obfuscated to make this dataset more abstract. Information like airline carrier, origin airport, travel date, price, offers, and travel time have been made unavailable - all which could and would have an impact on a customer service score.
import pandas as pd
import numpy as np
import scipy.stats as sp
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from matplotlib import rcParams
# Font Size
rcParams['font.size'] = 12
# Figure Size
rcParams['figure.figsize'] = 7, 5
air_train_df = pd.read_csv('./Data/air-train.csv')
air_train_df.head()
| Unnamed: 0 | id | Gender | Customer Type | Age | Type of Travel | Class | Flight Distance | Inflight wifi service | Departure/Arrival time convenient | ... | Inflight entertainment | On-board service | Leg room service | Baggage handling | Checkin service | Inflight service | Cleanliness | Departure Delay in Minutes | Arrival Delay in Minutes | satisfaction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 70172 | Male | Loyal Customer | 13 | Personal Travel | Eco Plus | 460 | 3 | 4 | ... | 5 | 4 | 3 | 4 | 4 | 5 | 5 | 25 | 18.0 | neutral or dissatisfied |
| 1 | 1 | 5047 | Male | disloyal Customer | 25 | Business travel | Business | 235 | 3 | 2 | ... | 1 | 1 | 5 | 3 | 1 | 4 | 1 | 1 | 6.0 | neutral or dissatisfied |
| 2 | 2 | 110028 | Female | Loyal Customer | 26 | Business travel | Business | 1142 | 2 | 2 | ... | 5 | 4 | 3 | 4 | 4 | 4 | 5 | 0 | 0.0 | satisfied |
| 3 | 3 | 24026 | Female | Loyal Customer | 25 | Business travel | Business | 562 | 2 | 5 | ... | 2 | 2 | 5 | 3 | 1 | 4 | 2 | 11 | 9.0 | neutral or dissatisfied |
| 4 | 4 | 119299 | Male | Loyal Customer | 61 | Business travel | Business | 214 | 3 | 3 | ... | 3 | 3 | 4 | 4 | 3 | 3 | 3 | 0 | 0.0 | satisfied |
5 rows × 25 columns
air_train_df.satisfaction.value_counts()
neutral or dissatisfied 58879 satisfied 45025 Name: satisfaction, dtype: int64
air_test_df = pd.read_csv('./Data/air-test.csv')
air_test_df.head()
| Unnamed: 0 | id | Gender | Customer Type | Age | Type of Travel | Class | Flight Distance | Inflight wifi service | Departure/Arrival time convenient | ... | Inflight entertainment | On-board service | Leg room service | Baggage handling | Checkin service | Inflight service | Cleanliness | Departure Delay in Minutes | Arrival Delay in Minutes | satisfaction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 19556 | Female | Loyal Customer | 52 | Business travel | Eco | 160 | 5 | 4 | ... | 5 | 5 | 5 | 5 | 2 | 5 | 5 | 50 | 44.0 | satisfied |
| 1 | 1 | 90035 | Female | Loyal Customer | 36 | Business travel | Business | 2863 | 1 | 1 | ... | 4 | 4 | 4 | 4 | 3 | 4 | 5 | 0 | 0.0 | satisfied |
| 2 | 2 | 12360 | Male | disloyal Customer | 20 | Business travel | Eco | 192 | 2 | 0 | ... | 2 | 4 | 1 | 3 | 2 | 2 | 2 | 0 | 0.0 | neutral or dissatisfied |
| 3 | 3 | 77959 | Male | Loyal Customer | 44 | Business travel | Business | 3377 | 0 | 0 | ... | 1 | 1 | 1 | 1 | 3 | 1 | 4 | 0 | 6.0 | satisfied |
| 4 | 4 | 36875 | Female | Loyal Customer | 49 | Business travel | Eco | 1182 | 2 | 3 | ... | 2 | 2 | 2 | 2 | 4 | 2 | 4 | 0 | 20.0 | satisfied |
5 rows × 25 columns
# Training dataset
air_train_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 103904 entries, 0 to 103903 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 103904 non-null int64 1 id 103904 non-null int64 2 Gender 103904 non-null object 3 Customer Type 103904 non-null object 4 Age 103904 non-null int64 5 Type of Travel 103904 non-null object 6 Class 103904 non-null object 7 Flight Distance 103904 non-null int64 8 Inflight wifi service 103904 non-null int64 9 Departure/Arrival time convenient 103904 non-null int64 10 Ease of Online booking 103904 non-null int64 11 Gate location 103904 non-null int64 12 Food and drink 103904 non-null int64 13 Online boarding 103904 non-null int64 14 Seat comfort 103904 non-null int64 15 Inflight entertainment 103904 non-null int64 16 On-board service 103904 non-null int64 17 Leg room service 103904 non-null int64 18 Baggage handling 103904 non-null int64 19 Checkin service 103904 non-null int64 20 Inflight service 103904 non-null int64 21 Cleanliness 103904 non-null int64 22 Departure Delay in Minutes 103904 non-null int64 23 Arrival Delay in Minutes 103594 non-null float64 24 satisfaction 103904 non-null object dtypes: float64(1), int64(19), object(5) memory usage: 19.8+ MB
# Test dataset
air_test_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25976 entries, 0 to 25975 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 25976 non-null int64 1 id 25976 non-null int64 2 Gender 25976 non-null object 3 Customer Type 25976 non-null object 4 Age 25976 non-null int64 5 Type of Travel 25976 non-null object 6 Class 25976 non-null object 7 Flight Distance 25976 non-null int64 8 Inflight wifi service 25976 non-null int64 9 Departure/Arrival time convenient 25976 non-null int64 10 Ease of Online booking 25976 non-null int64 11 Gate location 25976 non-null int64 12 Food and drink 25976 non-null int64 13 Online boarding 25976 non-null int64 14 Seat comfort 25976 non-null int64 15 Inflight entertainment 25976 non-null int64 16 On-board service 25976 non-null int64 17 Leg room service 25976 non-null int64 18 Baggage handling 25976 non-null int64 19 Checkin service 25976 non-null int64 20 Inflight service 25976 non-null int64 21 Cleanliness 25976 non-null int64 22 Departure Delay in Minutes 25976 non-null int64 23 Arrival Delay in Minutes 25893 non-null float64 24 satisfaction 25976 non-null object dtypes: float64(1), int64(19), object(5) memory usage: 5.0+ MB
# Checking values of scored columns, using cleanliness as score to test
air_train_df.Cleanliness.value_counts()
4 27179 3 24574 5 22689 2 16132 1 13318 0 12 Name: Cleanliness, dtype: int64
The columns Unnamed: 0 and id are of no use to us, so they can be dropped right away. For the sake of efficiency, we'll apply this dataframe manipulation to both the training and testing datasets.
Save for the 5 categorical variables that need to be converted before model ingestion, there is a data type mismatch in the delay columns. The arrival column is of a float datatype while the departure column is a whole integer datatype. Regardless of whether or not necessary, I will match the two datatypes as floats for consistency's sake when performing further analyses.
In the training dataset, there are 310 missing values in the column detailing the on-time performance of a particular voyage (Arrival Delay in Minutes). We can observe 83 values missing in the test set in this same column.
Since we cannot accurately decide whether or not a null value indicates that the trip was not delayed, we will fill this with the mean value of the column.
In examining the dataframe's first 5 rows, I can see rows where there is a value of 0 found in columns that must be ranked on a scale from 1 to 5. For simplicity's sake - I will be dropping rows where a 0 value exists. This should reduce the complexity of our data set and also make it easier for the model to ingest later on.
def clean_data(orig_df):
'''
This function applies 5 steps to the dataframe to clean the data.
1. Dropping of unnecessary columns
2. Uniformize datatypes in delay column
3. Normalizing column names.
4. Normalizing text values in columns.
5. Imputing numeric null values with the mean value of the column.
6. Dropping "zero" values from ranked categorical variables.
7. Creating aggregated flight delay column
Return: Cleaned DataFrame, ready for analysis - final encoding still to be applied.
'''
df = orig_df.copy()
'''1. Dropping off unnecessary columns'''
df.drop(['Unnamed: 0', 'id'], axis = 1, inplace = True)
'''2. Uniformizing datatype in delay column'''
df['Departure Delay in Minutes'] = df['Departure Delay in Minutes'].astype(float)
'''3. Normalizing column names'''
df.columns = df.columns.str.lower()
'''Replacing spaces and other characters with underscores, this is more
for us to make it easier to work with them and so that we can call them using dot notation.'''
special_chars = "/ -"
for special_char in special_chars:
df.columns = [col.replace(special_char, '_') for col in df.columns]
'''4. Normalizing text values in columns'''
cat_cols = ['gender', 'customer_type', 'class', 'type_of_travel', 'satisfaction']
for column in cat_cols:
df[column] = df[column].str.lower()
'''5. Imputing the nulls in the arrival delay column with the mean.
Since we cannot safely equate these nulls to a zero value, the mean value of the column is the
most sensible method of replacement.'''
df['arrival_delay_in_minutes'].fillna(df['arrival_delay_in_minutes'].mean(), inplace = True)
df.round({'arrival_delay_in_minutes' : 1})
'''6. Dropping rows from ranked value columns where "zero" exists as a value
Since these columns are meant to be ranked on a scale from 1 to 5, having zero as a value
does not make sense nor does it help us in any way.'''
rank_list = ["inflight_wifi_service", "departure_arrival_time_convenient", "ease_of_online_booking", "gate_location",
"food_and_drink", "online_boarding", "seat_comfort", "inflight_entertainment", "on_board_service",
"leg_room_service", "baggage_handling", "checkin_service", "inflight_service", "cleanliness"]
'''7. Creating aggregated and categorical flight delay columns'''
df['total_delay_time'] = (df['departure_delay_in_minutes'] + df['arrival_delay_in_minutes'])
df['was_flight_delayed'] = np.nan
df['was_flight_delayed'] = np.where(df['total_delay_time'] > 0, 'yes', 'no')
for col in rank_list:
df.drop(df.loc[df[col]==0].index, inplace=True)
cleaned_df = df
return cleaned_df
air_train_cleaned = clean_data(air_train_df)
air_train_cleaned.head()
| gender | customer_type | age | type_of_travel | class | flight_distance | inflight_wifi_service | departure_arrival_time_convenient | ease_of_online_booking | gate_location | ... | leg_room_service | baggage_handling | checkin_service | inflight_service | cleanliness | departure_delay_in_minutes | arrival_delay_in_minutes | satisfaction | total_delay_time | was_flight_delayed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | male | loyal customer | 13 | personal travel | eco plus | 460 | 3 | 4 | 3 | 1 | ... | 3 | 4 | 4 | 5 | 5 | 25.0 | 18.0 | neutral or dissatisfied | 43.0 | yes |
| 1 | male | disloyal customer | 25 | business travel | business | 235 | 3 | 2 | 3 | 3 | ... | 5 | 3 | 1 | 4 | 1 | 1.0 | 6.0 | neutral or dissatisfied | 7.0 | yes |
| 2 | female | loyal customer | 26 | business travel | business | 1142 | 2 | 2 | 2 | 2 | ... | 3 | 4 | 4 | 4 | 5 | 0.0 | 0.0 | satisfied | 0.0 | no |
| 3 | female | loyal customer | 25 | business travel | business | 562 | 2 | 5 | 5 | 5 | ... | 5 | 3 | 1 | 4 | 2 | 11.0 | 9.0 | neutral or dissatisfied | 20.0 | yes |
| 4 | male | loyal customer | 61 | business travel | business | 214 | 3 | 3 | 3 | 3 | ... | 4 | 4 | 3 | 3 | 3 | 0.0 | 0.0 | satisfied | 0.0 | no |
5 rows × 25 columns
air_test_cleaned = clean_data(air_test_df)
Here I am checking of imbalance of outcome in our prediction column. This will inform whether or not additional steps will be needed before moving on (resampling, downsampling, etc.).
fig = plt.figure(figsize = (10,7))
air_train_cleaned.satisfaction.value_counts(normalize = True).plot(kind='bar', alpha = 0.9, rot=0)
plt.title('Customer satisfaction')
plt.ylabel('Percent')
plt.show()
Observations : The prediction columns are not completely even, as they are the ratio of dissatisfied to satisfied customers is 55 to 45. However, I would not say that the data presents itself as imbalanced.
I can say that the data does not require additional treatment or resampling.
What do the categorical variables look like? What is the spread of responses across the dataset?
from matplotlib.pyplot import figure
categoricals = ["inflight_wifi_service", "departure_arrival_time_convenient", "ease_of_online_booking",
"gate_location", "food_and_drink", "online_boarding", "seat_comfort", "inflight_entertainment",
"on_board_service", "leg_room_service", "baggage_handling", "checkin_service",
"inflight_service", "cleanliness"]
air_train_cleaned.hist(column = categoricals, layout=(4,4), label='x', figsize = (20,20));
with sns.axes_style(style = 'ticks'):
d = sns.histplot(x = "gender", hue= 'satisfaction', data = air_train_cleaned,
stat = 'percent', multiple="dodge", palette = 'Set1')
with sns.axes_style(style = 'ticks'):
d = sns.histplot(x = "customer_type", hue= 'satisfaction', data = air_train_cleaned,
stat = 'percent', multiple="dodge", palette = 'Set1')
Observations : The gender groups are equally likely to be either satisfied or dissatisfied. This slight skewness reflects the proportions of the target variable.
Loyalty customers outweigh non-loyalty customers dramatically. However, among both groups - dissatisfied customers make up the majority class.
There is no difference between the positive and negative outcomes of the target class among these two groups - something that suggests there could be a fundamental difference in these groups to tell us something. Needs clearer wording
While I'm not discounting these variables outright as far as feature importance is concerned, I don't believe that there are any ingsights to be gained from further analysis of these two variables.
with sns.axes_style(style = 'ticks'):
d = sns.histplot(x = "class", hue= 'satisfaction', data = air_train_cleaned,
stat = 'percent', multiple="dodge", palette = 'Set1')
Observations : Passengers who travel in a business class cabin are far more likely to finish their trip with a positive impression than are those who fly in economy or economy plus/premium economy.
The dramatic difference in satisfaction between fliers in a business class cabin versus an economy type cabin suggests that there may be characteristic differences here informing these distributions. For that reason I'd like to examine this column further.
with sns.axes_style(style = 'ticks'):
d = sns.histplot(x = "type_of_travel", hue= 'satisfaction', data = air_train_cleaned,
stat = 'percent', multiple="dodge", palette = 'Set1')
Observations : Passengers travelling for business seem far more likely to be satisfied with the experience of their trip than those who travel for personal reasons.
In both groups of this column, there are dramatic and inverse proportions regarding the positive and negative outcomes of the target variable — satisfied customers vs dissatisfied or neutral.
This is another column where I think we can find great insights from further analyses.
# Countplot comparing age to satisfaction
with sns.axes_style('white'):
g = sns.catplot(x = 'age', data = air_train_cleaned,
kind = 'count', hue = 'satisfaction', order = range(7, 80),
height = 8.27, aspect=18.7/8.27, legend = False,
palette = 'Set1')
plt.legend(loc='upper right');
sns.violinplot(data = air_train_cleaned, x = "satisfaction", y = "age", palette='Set1')
<AxesSubplot:xlabel='satisfaction', ylabel='age'>
Observations : Majority of age customers across all age groups are generally more dissatisfied/neutral regarding their trip experience. However, there is a group of customers ranging from age 38 to 61 who are generally satisfied with their trip experience.
As with the previous two columns, the discrepancy points to potential insights to be unearthed by conducting further analyses.
sns.violinplot(data = air_train_cleaned, x = "class", y = "age",
hue = 'satisfaction', palette = 'Set1', split = True)
<AxesSubplot:xlabel='class', ylabel='age'>
sns.violinplot(data = air_train_cleaned, x = "satisfaction", y = "flight_distance", palette = 'Set1')
<AxesSubplot:xlabel='satisfaction', ylabel='flight_distance'>
sns.violinplot(data = air_train_cleaned, x = "class", y = "flight_distance",
hue = 'satisfaction', palette = 'Set1', split = True)
<AxesSubplot:xlabel='class', ylabel='flight_distance'>
fig, axes = plt.subplots(1, 2)
sns.violinplot(data = air_train_cleaned, x = "satisfaction",
y = "departure_delay_in_minutes", ax = axes[0], palette = 'Set1')
sns.violinplot(data = air_train_cleaned, x = "satisfaction",
y = "arrival_delay_in_minutes", ax = axes[1], palette = 'Set1')
<AxesSubplot:xlabel='satisfaction', ylabel='arrival_delay_in_minutes'>
sns.violinplot(data = air_train_cleaned, x = "satisfaction", y = "total_delay_time", palette = 'Set1', split = True)
<AxesSubplot:xlabel='satisfaction', ylabel='total_delay_time'>
Observations : With a majority of flights not having significant delays, I don't believe this particularly is a factor determining passenger satisfaction.
air_train_cleaned.was_flight_delayed.value_counts()
yes 52436 no 43268 Name: was_flight_delayed, dtype: int64
score_cols = ["inflight_wifi_service", "departure_arrival_time_convenient", "ease_of_online_booking",
"gate_location","food_and_drink", "online_boarding", "seat_comfort", "inflight_entertainment",
"on_board_service","leg_room_service", "baggage_handling", "checkin_service", "inflight_service",
"cleanliness"]
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)
# Loop through scored columns
for n, score_col in enumerate(score_cols):
# Add a new subplot iteratively
ax = plt.subplot(4, 4, n + 1)
# Filter df and plot scored column on new axis
sns.violinplot(data = air_train_cleaned,
x = 'satisfaction',
y = score_col,
ax = ax,
palette = 'Set1')
# chart formatting
ax.set_title(score_col),
ax.set_xlabel("")
Observation : Lower scores rating in-flight wi-fi and online booking have a higher impact on passenger dissatisfaction. Meanwhile passenger dissatisfaction has more influence from middle-high scores rating gate location, baggage handling, check-in, and in-flight customer service.
High marks in nearly all categories illustrate a clear impact on customer satisfaction.
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)
# Loop through scored columns
for n, score_col in enumerate(score_cols):
# Add a new subplot iteratively
ax = plt.subplot(4, 4, n + 1)
# Filter df and plot scored column on new axis
sns.violinplot(data = air_train_cleaned,
x = score_col,
y = 'age',
hue = "satisfaction",
split = True,
ax = ax,
palette = 'Set1')
# Chart formatting
ax.set_title(score_col),
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
fancybox=True, shadow=True, ncol=5)
ax.set_xlabel("")
Observations : For many of these customer service categories, the individual distributions of satisfaction vs dissatisfaction across each rating for customer service directly mirror the bivariate distributions of age vs satisfaction. There is a noticeable peak in the 37-60 age group across most of these distributions.
Certain customer service variables (in-flight customer service, baggage handling, leg room, on-board service quality, and inflight entertainment) have passengers measured as satisfied despite scoring the service column poorly. In the leg room category, there are a large amount of younger travelers dissatisfied despite scoring the column highly.
There are interesting distributions in the online boarding category - where the 40-60 age group is measured as satisfied despite giving the service column the lowest possible score, and dissatisfied despite giving the service column the highest possible score.
I can't see any trends that explicitly illustrate age as a factor with impact on customer satisfaction.
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)
# Loop through scored columns
for n, score_col in enumerate(score_cols):
# Add a new subplot iteratively
ax = plt.subplot(4, 4, n + 1)
# Filter df and plot scored column on new axis
sns.violinplot(data = air_train_cleaned,
x = score_col,
y = 'flight_distance',
hue = "satisfaction",
split = True,
ax = ax,
palette = 'Set1')
# Chart formatting
ax.set_title(score_col)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
fancybox=True, shadow=True, ncol=5)
ax.set_xlabel("")
Observations : While many of these features have equal distributions of satisfied and dissatisfied customers acrossall of the flight distances - there are a few points that suggest certain features having more of an impact informing satisfaction.
There is a higher rate of satisfaction for customers who provided the highest mark for in-flight wi-fi service and online booking ease.
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)
# Loop through scored columns
for n, score_col in enumerate(score_cols):
# Add a new subplot iteratively
ax = plt.subplot(4, 4, n + 1)
# Filter df and plot scored column on new axis
sns.violinplot(data = air_train_cleaned,
x = 'class',
y = score_col,
hue = "satisfaction",
split = True,
ax = ax,
palette = 'Set1')
# Chart formatting
ax.set_title(score_col)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
fancybox=True, shadow=True, ncol=5)
ax.set_xlabel("")
Observations : These discrete distributions tell me that cabin class is going to be a large determinant informing passenger satisfaction. There are clear peaks in satisfaction and dissatisfaction for in-flight wi-fi service, online boarding, seat comfort, in-flight entertainment, on-board customer service, leg room, and in-inflight customer service.
Online boarding and cabin cleanlilness appear to be an important factor in driving satisfaction regardless of cabin class in which the passenger is traveling.
On-board service quality, leg room, baggage handling, check-in service, and inflight customer service appear to be important factors informing passenger satisfaction for business class passengers.
An interesting plot is the in-flight wi-fi service column. High ratings appear to have a high impact driving satisfaction for customers flying in economy plus and economy classes. It doesn't appear to have as much of an impact on satisfaction for business class travelers, but mediocre or low ratings appear to have a high impact on dissatisfaction for travelers in this cabin.
plt.figure(figsize=(40, 20))
plt.subplots_adjust(hspace=0.3)
# Loop through scored columns
for n, score_col in enumerate(score_cols):
# Add a new subplot iteratively
ax = plt.subplot(4, 4, n + 1)
# Filter df and plot scored column on new axis
sns.violinplot(data = air_train_cleaned,
x = 'type_of_travel',
y = score_col,
hue = "satisfaction",
split = True,
ax = ax,
palette = 'Set1')
# Chart formatting
ax.set_title(score_col)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05),
fancybox=True, shadow=True, ncol=5)
ax.set_xlabel("")
Observations : The distributions across these service columns considering the purpose of travel illustrate a high importance in informing customer satisfaction.
Similarly to the trend we observed regarding cabin class, high ratings in boarding process have a high impact on satisfaction. Inflight wi-fi service also has a good impact on customer satisfaction, though we can observe an almost inverse effect when considering the purpose of travel. For personal travelers, high ratings impact satisfaction, though satisfaction seems guaranteed for business travelers regardless of rating. However, low scores for in-flight wi-fi from business travelers impacts dissatisfaction.
We can also observe the same drivers of satisfaction for business travelers (business purpose and business cabin). These being inflight customer service, cleanliness, check-in service, baggage handling, legroom, and on-board service quality.
air_train_cleaned.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 95704 entries, 0 to 103903 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 95704 non-null object 1 customer_type 95704 non-null object 2 age 95704 non-null int64 3 type_of_travel 95704 non-null object 4 class 95704 non-null object 5 flight_distance 95704 non-null int64 6 inflight_wifi_service 95704 non-null int64 7 departure_arrival_time_convenient 95704 non-null int64 8 ease_of_online_booking 95704 non-null int64 9 gate_location 95704 non-null int64 10 food_and_drink 95704 non-null int64 11 online_boarding 95704 non-null int64 12 seat_comfort 95704 non-null int64 13 inflight_entertainment 95704 non-null int64 14 on_board_service 95704 non-null int64 15 leg_room_service 95704 non-null int64 16 baggage_handling 95704 non-null int64 17 checkin_service 95704 non-null int64 18 inflight_service 95704 non-null int64 19 cleanliness 95704 non-null int64 20 departure_delay_in_minutes 95704 non-null float64 21 arrival_delay_in_minutes 95704 non-null float64 22 satisfaction 95704 non-null object 23 total_delay_time 95704 non-null float64 24 was_flight_delayed 95704 non-null object dtypes: float64(3), int64(16), object(6) memory usage: 19.0+ MB
air_train_cleaned.head()
| gender | customer_type | age | type_of_travel | class | flight_distance | inflight_wifi_service | departure_arrival_time_convenient | ease_of_online_booking | gate_location | ... | leg_room_service | baggage_handling | checkin_service | inflight_service | cleanliness | departure_delay_in_minutes | arrival_delay_in_minutes | satisfaction | total_delay_time | was_flight_delayed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | male | loyal customer | 13 | personal travel | eco plus | 460 | 3 | 4 | 3 | 1 | ... | 3 | 4 | 4 | 5 | 5 | 25.0 | 18.0 | neutral or dissatisfied | 43.0 | yes |
| 1 | male | disloyal customer | 25 | business travel | business | 235 | 3 | 2 | 3 | 3 | ... | 5 | 3 | 1 | 4 | 1 | 1.0 | 6.0 | neutral or dissatisfied | 7.0 | yes |
| 2 | female | loyal customer | 26 | business travel | business | 1142 | 2 | 2 | 2 | 2 | ... | 3 | 4 | 4 | 4 | 5 | 0.0 | 0.0 | satisfied | 0.0 | no |
| 3 | female | loyal customer | 25 | business travel | business | 562 | 2 | 5 | 5 | 5 | ... | 5 | 3 | 1 | 4 | 2 | 11.0 | 9.0 | neutral or dissatisfied | 20.0 | yes |
| 4 | male | loyal customer | 61 | business travel | business | 214 | 3 | 3 | 3 | 3 | ... | 4 | 4 | 3 | 3 | 3 | 0.0 | 0.0 | satisfied | 0.0 | no |
5 rows × 25 columns
from sklearn.preprocessing import OrdinalEncoder
def encode_data(orig_df):
'''
Encodes remaining categorical variables of data frame to be ready for model ingestion
Inputs:
Dataframe
Manipulations:
Encoding of categorical variables.
Return:
Encoded Column Values
'''
'''
Ordinal encode of scored rating columns.
'''
df = orig_df.copy()
encoder = OrdinalEncoder()
for j in score_cols:
df[j] = encoder.fit_transform(df[[j]])
'''
Replacement of binary categories.
'''
df.was_flight_delayed.replace({'no': 0, 'yes' : 1}, inplace = True)
df['satisfaction'].replace({'neutral or dissatisfied': 0, 'satisfied': 1},inplace = True)
df.customer_type.replace({'disloyal customer': 0, 'loyal customer': 1}, inplace = True)
df.type_of_travel.replace({'personal travel': 0, 'business travel': 1}, inplace = True)
df.gender.replace({'male': 0, 'female' : 1}, inplace = True)
encoded_df = pd.get_dummies(df, columns = ['class'])
return encoded_df
# Applying encoding to training dataset
air_train_encoded = encode_data(air_train_cleaned)
air_train_encoded.cleanliness.value_counts()
3.0 25294 2.0 22634 4.0 20928 1.0 14727 0.0 12121 Name: cleanliness, dtype: int64
air_train_encoded.head()
| gender | customer_type | age | type_of_travel | flight_distance | inflight_wifi_service | departure_arrival_time_convenient | ease_of_online_booking | gate_location | food_and_drink | ... | inflight_service | cleanliness | departure_delay_in_minutes | arrival_delay_in_minutes | satisfaction | total_delay_time | was_flight_delayed | class_business | class_eco | class_eco plus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 13 | 0 | 460 | 2.0 | 3.0 | 2.0 | 0.0 | 4.0 | ... | 4.0 | 4.0 | 25.0 | 18.0 | 0 | 43.0 | 1 | 0 | 0 | 1 |
| 1 | 0 | 0 | 25 | 1 | 235 | 2.0 | 1.0 | 2.0 | 2.0 | 0.0 | ... | 3.0 | 0.0 | 1.0 | 6.0 | 0 | 7.0 | 1 | 1 | 0 | 0 |
| 2 | 1 | 1 | 26 | 1 | 1142 | 1.0 | 1.0 | 1.0 | 1.0 | 4.0 | ... | 3.0 | 4.0 | 0.0 | 0.0 | 1 | 0.0 | 0 | 1 | 0 | 0 |
| 3 | 1 | 1 | 25 | 1 | 562 | 1.0 | 4.0 | 4.0 | 4.0 | 1.0 | ... | 3.0 | 1.0 | 11.0 | 9.0 | 0 | 20.0 | 1 | 1 | 0 | 0 |
| 4 | 0 | 1 | 61 | 1 | 214 | 2.0 | 2.0 | 2.0 | 2.0 | 3.0 | ... | 2.0 | 2.0 | 0.0 | 0.0 | 1 | 0.0 | 0 | 1 | 0 | 0 |
5 rows × 27 columns
air_train_encoded.satisfaction.value_counts()
0 54947 1 40757 Name: satisfaction, dtype: int64
type(air_train_encoded)
pandas.core.frame.DataFrame
air_train_encoded.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 95704 entries, 0 to 103903 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 95704 non-null int64 1 customer_type 95704 non-null int64 2 age 95704 non-null int64 3 type_of_travel 95704 non-null int64 4 flight_distance 95704 non-null int64 5 inflight_wifi_service 95704 non-null float64 6 departure_arrival_time_convenient 95704 non-null float64 7 ease_of_online_booking 95704 non-null float64 8 gate_location 95704 non-null float64 9 food_and_drink 95704 non-null float64 10 online_boarding 95704 non-null float64 11 seat_comfort 95704 non-null float64 12 inflight_entertainment 95704 non-null float64 13 on_board_service 95704 non-null float64 14 leg_room_service 95704 non-null float64 15 baggage_handling 95704 non-null float64 16 checkin_service 95704 non-null float64 17 inflight_service 95704 non-null float64 18 cleanliness 95704 non-null float64 19 departure_delay_in_minutes 95704 non-null float64 20 arrival_delay_in_minutes 95704 non-null float64 21 satisfaction 95704 non-null int64 22 total_delay_time 95704 non-null float64 23 was_flight_delayed 95704 non-null int64 24 class_business 95704 non-null uint8 25 class_eco 95704 non-null uint8 26 class_eco plus 95704 non-null uint8 dtypes: float64(17), int64(7), uint8(3) memory usage: 18.5 MB
# Applying encoding to test dataset
air_test_encoded = encode_data(air_test_cleaned)
train_corr = air_train_encoded.corr()[['satisfaction']]
train_corr = train_corr
plt.figure(figsize=(10, 12))
heatmap = sns.heatmap(train_corr.sort_values(by='satisfaction', ascending=False),
vmin=-1, vmax=1, annot=True, cmap='Blues')
heatmap.set_title('Feature Correlation with Target Variable', fontdict={'fontsize':14});
# Pre-processing and scaling dataset for feature selection
from sklearn import preprocessing
r_scaler = preprocessing.MinMaxScaler()
r_scaler.fit(air_train_encoded)
air_train_scaled = pd.DataFrame(r_scaler.transform(air_train_encoded), columns = air_train_encoded.columns)
air_train_scaled.head()
| gender | customer_type | age | type_of_travel | flight_distance | inflight_wifi_service | departure_arrival_time_convenient | ease_of_online_booking | gate_location | food_and_drink | ... | inflight_service | cleanliness | departure_delay_in_minutes | arrival_delay_in_minutes | satisfaction | total_delay_time | was_flight_delayed | class_business | class_eco | class_eco plus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.076923 | 0.0 | 0.086632 | 0.50 | 0.75 | 0.50 | 0.00 | 1.00 | ... | 1.00 | 1.00 | 0.015704 | 0.011364 | 0.0 | 0.013539 | 1.0 | 0.0 | 0.0 | 1.0 |
| 1 | 0.0 | 0.0 | 0.230769 | 1.0 | 0.041195 | 0.50 | 0.25 | 0.50 | 0.50 | 0.00 | ... | 0.75 | 0.00 | 0.000628 | 0.003788 | 0.0 | 0.002204 | 1.0 | 1.0 | 0.0 | 0.0 |
| 2 | 1.0 | 1.0 | 0.243590 | 1.0 | 0.224354 | 0.25 | 0.25 | 0.25 | 0.25 | 1.00 | ... | 0.75 | 1.00 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.0 | 1.0 | 0.0 | 0.0 |
| 3 | 1.0 | 1.0 | 0.230769 | 1.0 | 0.107229 | 0.25 | 1.00 | 1.00 | 1.00 | 0.25 | ... | 0.75 | 0.25 | 0.006910 | 0.005682 | 0.0 | 0.006297 | 1.0 | 1.0 | 0.0 | 0.0 |
| 4 | 0.0 | 1.0 | 0.692308 | 1.0 | 0.036955 | 0.50 | 0.50 | 0.50 | 0.50 | 0.75 | ... | 0.50 | 0.50 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.0 | 1.0 | 0.0 | 0.0 |
5 rows × 27 columns
# Feature selection, applying Select K Best and Chi2 to output the 15 most important features
from sklearn.feature_selection import SelectKBest, chi2
X = air_train_scaled.loc[:,air_train_scaled.columns!='satisfaction']
y = air_train_scaled[['satisfaction']]
selector = SelectKBest(chi2, k = 10)
selector.fit(X, y)
X_new = selector.transform(X)
features = (X.columns[selector.get_support(indices=True)])
features
Index(['type_of_travel', 'inflight_wifi_service', 'online_boarding',
'seat_comfort', 'inflight_entertainment', 'on_board_service',
'leg_room_service', 'cleanliness', 'class_business', 'class_eco'],
dtype='object')
selector.pvalues_
array([6.69798296e-003, 6.51597103e-161, 2.74981007e-046, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 2.62274871e-016, 2.67143169e-249,
5.00768922e-001, 2.65703013e-214, 0.00000000e+000, 0.00000000e+000,
0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 1.17553644e-202,
4.62662048e-217, 7.43319643e-194, 0.00000000e+000, 9.00744872e-005,
8.65788663e-006, 2.87418958e-005, 1.36828268e-045, 0.00000000e+000,
0.00000000e+000, 4.45255856e-231])
Observations : With Chi-Square p-value as the selection criteria, many of the features picked are the different scored aspects of a customer experience, in addition to the reason for travel and the cabin class in which they are travelling.
import sklearn
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import CategoricalNB
import xgboost
from xgboost import XGBClassifier
# Features as selected from feature importance
features = features
# Specifying target variable
target = ['satisfaction']
# Splitting into train and test
X_train = air_train_encoded[features].to_numpy()
X_test = air_test_encoded[features]
y_train = air_train_encoded[target].to_numpy()
y_test = air_test_encoded[target]
X_test.shape
(23863, 10)
# Time scores and metrics imports
import time
from resource import getrusage, RUSAGE_SELF
from sklearn.metrics import accuracy_score, roc_auc_score, plot_confusion_matrix, plot_roc_curve, precision_score, recall_score
# Model activation and result plot function
def get_model_metrics(model, X_train, X_test, y_train, y_test):
'''
Model activation function, takes in model as a parameter and returns metrics as specified.
Inputs:
model, X_train, y_train, X_test, y_test
Output:
Model output metrics, confusion matrix, ROC AUC curve
'''
# Mark of current time when model began running
t0 = time.time()
# Fit the model on the training data and run predictions on test data
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:,1]
# Obtain training accuracy as a comparative metric using Sklearn's metrics package
train_score = model.score(X_train, y_train)
# Obtain testing accuracy as a comparative metric using Sklearn's metrics package
accuracy = accuracy_score(y_test, y_pred)
# Obtain precision from predictions using Sklearn's metrics package
precision = precision_score(y_test, y_pred)
# Obtain recall from predictions using Sklearn's metrics package
recall = recall_score(y_test, y_pred)
# Obtain ROC score from predictions using Sklearn's metrics package
roc = roc_auc_score(y_test, y_pred_proba)
# Obtain the time taken used to run the model, by subtracting the start time from the current time
time_taken = time.time() - t0
# Obtain the resources consumed in running the model
memory_used = int(getrusage(RUSAGE_SELF).ru_maxrss / 1024)
# Outputting the metrics of the model performance
print("Accuracy on Training = {}".format(train_score))
print("Accuracy on Test = {} • Precision = {}".format(accuracy, precision))
print("Recall = {} • ROC Area under Curve = {}".format(recall, roc))
print("Time taken = {} seconds • Memory consumed = {} Bytes".format(time_taken, memory_used))
# Plotting the confusion matrix of the model's predictive capabilities
plot_confusion_matrix(model, X_test, y_test, cmap = plt.cm.Blues, normalize = 'all')
# Plotting the ROC AUC curve of the model
plot_roc_curve(model, X_test, y_test)
plt.show()
return model, train_score, accuracy, precision, recall, roc, time_taken, memory_used
LogisticRegression().get_params()
{'C': 1.0,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'l1_ratio': None,
'max_iter': 100,
'multi_class': 'auto',
'n_jobs': None,
'penalty': 'l2',
'random_state': None,
'solver': 'lbfgs',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
%%time
clf = LogisticRegression()
params = {'n_jobs': [0, 5, 10, 15, 20]}
rscv = RandomizedSearchCV(estimator = clf,
param_distributions = params,
scoring = 'f1',
n_iter = 10,
verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best parameters: {'n_jobs': 5}
CPU times: user 25.5 s, sys: 1.64 s, total: 27.1 s
Wall time: 23.9 s
model_lr = LogisticRegression(**params)
model_lr, train_lr, accuracy_lr, precision_lr, recall_lr, roc_lr, tt_lr, mu_lr = get_model_metrics(model_lr,
X_train,
X_test,
y_train,
y_test)
Accuracy on Training = 0.8776540165510324 Accuracy on Test = 0.873821397142019 • Precision = 0.8552837573385519 Recall = 0.8508712158084298 • ROC Area under Curve = 0.9450786859429265 Time taken = 1.0026297569274902 seconds • Memory consumed = 711520 Bytes
RandomForestClassifier().get_params()
{'bootstrap': True,
'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_jobs': None,
'oob_score': False,
'random_state': None,
'verbose': 0,
'warm_start': False}
%%time
clf = RandomForestClassifier()
params = { 'max_depth': [5, 10, 15, 20, 25, 30],
'max_leaf_nodes': [10, 20, 30, 40, 50],
'min_samples_split': [1, 2, 3, 4, 5]}
rscv = RandomizedSearchCV(estimator = clf,
param_distributions = params,
scoring = 'f1',
n_iter = 10,
verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters: {'min_samples_split': 5, 'max_leaf_nodes': 40, 'max_depth': 30}
CPU times: user 1min 4s, sys: 1.84 s, total: 1min 6s
Wall time: 1min 6s
model_rf = RandomForestClassifier(**params)
model_rf, train_rf, accuracy_rf, precision_rf, recall_rf, roc_rf, tt_rf, mu_rf = get_model_metrics(model_rf,
X_train,
X_test,
y_train,
y_test)
Accuracy on Training = 0.926418958455237 Accuracy on Test = 0.9277542639232285 • Precision = 0.9168210628961482 Recall = 0.9152146403192836 • ROC Area under Curve = 0.9751138080512387 Time taken = 2.7000980377197266 seconds • Memory consumed = 735700 Bytes
AdaBoostClassifier().get_params()
{'algorithm': 'SAMME.R',
'base_estimator': None,
'learning_rate': 1.0,
'n_estimators': 50,
'random_state': None}
%%time
clf = AdaBoostClassifier()
params = { 'n_estimators': [25, 50, 75, 100, 125, 150],
'learning_rate': [0.2, 0.4, 0.6, 0.8, 1.0]}
rscv = RandomizedSearchCV(estimator = clf,
param_distributions = params,
scoring = 'f1',
n_iter = 10,
verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters: {'n_estimators': 125, 'learning_rate': 0.6}
CPU times: user 1min 29s, sys: 3.28 s, total: 1min 32s
Wall time: 1min 32s
model_ada = AdaBoostClassifier(**params)
# Saving output metrics
model_ada, accuracy_ada, train_ada, precision_ada, recall_ada, roc_ada, tt_ada, mu_ada = get_model_metrics(model_ada,
X_train,
X_test,
y_train,
y_test)
Accuracy on Training = 0.9074960294240575 Accuracy on Test = 0.9061727360348657 • Precision = 0.9004985044865403 Recall = 0.879197897400954 • ROC Area under Curve = 0.962524637370356 Time taken = 4.641080856323242 seconds • Memory consumed = 739332 Bytes
CategoricalNB().get_params()
{'alpha': 1.0, 'class_prior': None, 'fit_prior': True, 'min_categories': None}
%%time
clf = CategoricalNB()
params = { 'alpha': [0.0001, 0.001, 0.1, 1, 10, 100, 1000],
'min_categories': [6, 8, 10]}
rscv = RandomizedSearchCV(estimator = clf,
param_distributions = params,
scoring = 'f1',
n_iter = 10,
verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters: {'min_categories': 8, 'alpha': 0.0001}
CPU times: user 1.67 s, sys: 96.6 ms, total: 1.77 s
Wall time: 1.77 s
model_cnb = CategoricalNB(**params)
# Saving Output Metrics
model_cnb, accuracy_cnb, train_cnb, precision_cnb, recall_cnb, roc_cnb, tt_cnb, mu_cnb = get_model_metrics(model_cnb,
X_train,
X_test,
y_train,
y_test)
Accuracy on Training = 0.8942572933210733 Accuracy on Test = 0.890122784226627 • Precision = 0.8703649917707426 Recall = 0.8751095103669814 • ROC Area under Curve = 0.9490493379166705 Time taken = 0.09247374534606934 seconds • Memory consumed = 751384 Bytes
XGBClassifier().get_params()
{'objective': 'binary:logistic',
'use_label_encoder': False,
'base_score': None,
'booster': None,
'callbacks': None,
'colsample_bylevel': None,
'colsample_bynode': None,
'colsample_bytree': None,
'early_stopping_rounds': None,
'enable_categorical': False,
'eval_metric': None,
'gamma': None,
'gpu_id': None,
'grow_policy': None,
'importance_type': None,
'interaction_constraints': None,
'learning_rate': None,
'max_bin': None,
'max_cat_to_onehot': None,
'max_delta_step': None,
'max_depth': None,
'max_leaves': None,
'min_child_weight': None,
'missing': nan,
'monotone_constraints': None,
'n_estimators': 100,
'n_jobs': None,
'num_parallel_tree': None,
'predictor': None,
'random_state': None,
'reg_alpha': None,
'reg_lambda': None,
'sampling_method': None,
'scale_pos_weight': None,
'subsample': None,
'tree_method': None,
'validate_parameters': None,
'verbosity': None}
%%time
clf = XGBClassifier()
params = { 'max_depth': [3, 5, 6, 10, 15, 20],
'learning_rate': [0.01, 0.1, 0.2, 0.3],
'n_estimators': [100, 500, 1000]}
rscv = RandomizedSearchCV(estimator = clf,
param_distributions = params,
scoring = 'f1',
n_iter = 10,
verbose = 1)
rscv.fit(X_train, y_train)
rscv.predict(X_test)
# Parameter object to be passed through to function activation
params = rscv.best_params_
print("Best parameters:", params)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters: {'n_estimators': 100, 'max_depth': 10, 'learning_rate': 0.1}
CPU times: user 2h 11min 40s, sys: 3min 9s, total: 2h 14min 50s
Wall time: 8min 40s
model_xgb = XGBClassifier(**params)
# Saving Output Metrics
model_xgb, accuracy_xgb, train_xgb, precision_xgb, recall_xgb, roc_xgb, tt_xgb, mu_xgb = get_model_metrics(model_xgb,
X_train,
X_test,
y_train,
y_test)
Accuracy on Training = 0.9503886984870016 Accuracy on Test = 0.9438042157314671 • Precision = 0.9466893378675735 Recall = 0.9213472208702423 • ROC Area under Curve = 0.9876708141468592 Time taken = 5.126989126205444 seconds • Memory consumed = 1375012 Bytes
# Collecting model data
training_scores = [train_lr, train_rf, train_ada, train_cnb, train_xgb]
accuracy = [accuracy_lr, accuracy_rf, accuracy_ada, accuracy_cnb, accuracy_xgb]
roc_scores = [roc_lr, roc_rf, roc_ada, roc_cnb, roc_xgb]
precision = [precision_lr, precision_rf, precision_ada, precision_cnb, precision_xgb]
recall = [recall_lr, recall_rf, recall_ada, recall_cnb, recall_xgb]
time_scores = [tt_lr, tt_rf, tt_ada, tt_cnb, tt_xgb]
memory_scores = [mu_lr, mu_rf, mu_ada, mu_cnb, mu_xgb]
model_data = {'Model': ['Logistic Regression', 'Random Forest', 'Adaptive Boost',
'Categorical Bayes', 'Extreme Gradient Boost'],
'Accuracy on Training' : training_scores,
'Accuracy on Test' : accuracy,
'ROC AUC Score' : roc_scores,
'Precision' : precision,
'Recall' : recall,
'Time Elapsed (seconds)' : time_scores,
'Memory Consumed (bytes)': memory_scores}
model_data = pd.DataFrame(model_data)
model_data
| Model | Accuracy on Training | Accuracy on Test | ROC AUC Score | Precision | Recall | Time Elapsed (seconds) | Memory Consumed (bytes) | |
|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.877654 | 0.873821 | 0.945079 | 0.855284 | 0.850871 | 1.002630 | 711520 |
| 1 | Random Forest | 0.926419 | 0.927754 | 0.975114 | 0.916821 | 0.915215 | 2.700098 | 735700 |
| 2 | Adaptive Boost | 0.906173 | 0.907496 | 0.962525 | 0.900499 | 0.879198 | 4.641081 | 739332 |
| 3 | Categorical Bayes | 0.890123 | 0.894257 | 0.949049 | 0.870365 | 0.875110 | 0.092474 | 751384 |
| 4 | Extreme Gradient Boost | 0.943804 | 0.950389 | 0.987671 | 0.946689 | 0.921347 | 5.126989 | 1375012 |
# Plotting each model's performance scores vs time elapsed
plt.rcParams["figure.figsize"] = (25,15)
ax1 = model_data.plot.bar(x = 'Model', y = ["Accuracy on Training", "Accuracy on Test", "ROC AUC Score",
"Precision", "Recall"],
cmap = 'coolwarm')
ax1.legend()
ax1.set_title("Model Comparison", fontsize = 18)
ax1.set_xlabel('Model', fontsize = 14)
ax1.set_ylabel('Result', fontsize = 14, color = 'Black');
# Plotting each model's memory consumption
ax1 = model_data.plot.bar(x = 'Model', y = 'Memory Consumed (bytes)')
ax1.set_title("Resources consumed per model (bytes)", fontsize = 18)
ax2 = model_data['Time Elapsed (seconds)'].plot(secondary_y = True, color = 'Gold', linewidth = 4, marker = 's')
ax1.set_xlabel('Model', fontsize = 14)
ax2.set_ylabel('Time Elapsed (seconds)', fontsize = 14, color = 'Gold', fontweight = 'bold')
ax1.set_ylabel('Memory Consumed (bytes)', fontsize = 14, color = 'Black');
Observations : At this point if I were to be packaging a model pipeline to recommend to a client, I would be promoting the Extreme Gradient Boosting, aka XGBoost algorithm. Although the model had not performed the most efficiently, it did have the most effective performance.
Metric: Weight
from xgboost import plot_importance
model_xgb.get_booster().feature_names = ['type_of_travel', 'inflight_wifi_service', 'online_boarding',
'seat_comfort', 'inflight_entertainment', 'on_board_service',
'leg_room_service', 'cleanliness', 'class_business', 'class_eco']
plot_importance(model_xgb)
plt.show()
Observations : In terms of feature weight (the amount of times a feature was used to split the data) — our model used the seat comfort feature the most to split the data across all possible trees in our model; followed by online boarding, inflight entertainment, onboard service quality, leg room, inflight wi-fi, and cleanliness.
import pickle
# Saving test model.
pickle.dump(model_xgb, open('./Models/model_xgb.pkl', 'wb'))
pickled_model = pickle.load(open('./Models/model_xgb.pkl', 'rb'))
X_test
| type_of_travel | inflight_wifi_service | online_boarding | seat_comfort | inflight_entertainment | on_board_service | leg_room_service | cleanliness | class_business | class_eco | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 4.0 | 3.0 | 2.0 | 4.0 | 4.0 | 4.0 | 4.0 | 0 | 1 |
| 1 | 1 | 0.0 | 3.0 | 4.0 | 3.0 | 3.0 | 3.0 | 4.0 | 1 | 0 |
| 4 | 1 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 0 | 1 |
| 5 | 1 | 2.0 | 4.0 | 2.0 | 4.0 | 3.0 | 2.0 | 4.0 | 0 | 1 |
| 6 | 1 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 2.0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 25971 | 1 | 2.0 | 2.0 | 3.0 | 3.0 | 2.0 | 1.0 | 3.0 | 1 | 0 |
| 25972 | 1 | 3.0 | 3.0 | 3.0 | 3.0 | 3.0 | 4.0 | 3.0 | 1 | 0 |
| 25973 | 0 | 1.0 | 0.0 | 1.0 | 1.0 | 3.0 | 2.0 | 1.0 | 0 | 1 |
| 25974 | 1 | 2.0 | 3.0 | 3.0 | 3.0 | 2.0 | 1.0 | 3.0 | 1 | 0 |
| 25975 | 0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0 | 1 |
23863 rows × 10 columns
pickled_model.predict(X_test)
array([1, 1, 0, ..., 0, 1, 0])
# !pip install shap
import shap
explainer = shap.Explainer(model_xgb, feature_names = features)
# Explain feature importance in producing model output from base value to actual output
shap_values = explainer(X_train)
shap_values.values[0]
array([-3.6781335 , -3.247354 , -1.294865 , 0.16754012, -0.31717607,
0.04861424, -0.12336768, 0.30478886, -0.51065814, 0.04983278],
dtype=float32)
shap.plots.bar(shap_values)
%%time
shap.initjs()
shap.summary_plot(shap_values, X_train, class_names=model_xgb.classes_)
CPU times: user 10.3 s, sys: 154 ms, total: 10.4 s Wall time: 10.4 s
Observations : Considering the mean SHAP value as our metric for feature importance, we can observe that in-flight wi-fi service is the most impactful feature of our data, followed closely by travel type and online boarding.
Looking at the SHAP summary plot (or beeswarm) which provides a more informative illustration of the feature value of each individual observation in our data and its impact on the model output — we can get a more complete picture of the overall feature impact on our model. The features are also ordered by their impact on prediction, but it also shows us how higher and lower values of each individual observation of a feature will affect the result.
For almost every feature, high feature values have a positive impact on prediction while low feature values have a negatie impact on prediction.
inflight_wifi_service feature impact¶shap.plots.scatter(shap_values[:, "inflight_wifi_service"], color=shap_values)
Observations : I wanted to take a closer look into the in-flight wi-fi feature. One, because it is the feature that had the most impact in determining model output; but also particularly because it was a feature that drew my attention during exploratory analysis, seeing as how certain values had high impact driving satisfaction and dissatisfaction among different travel classes and travel types.
To refresh - for travel type I encoded personal travel as 0 and business travel as 1. The manner in which the model utilizes this feature almost directly mirrors the data itself. Where high rankings by leisure travelers has a more informative impact on determining satisfaction and low ratings have more impact on dissatisfaction.
It can also be observed that for business travelers, regardless of how they rated the wi-fi service, there is a portion that are satisfied (positive SHAP value over negatie). However, the group of business travelers who displayed negative SHAP values or were dissatisfied gave low marks to their experience with wi-fi on board.
online_boarding feature impact¶shap.plots.scatter(shap_values[:, "online_boarding"], color=shap_values)
Observations : I also wanted to take a quick look at the SHAP impact of online boarding. This is another feature which showed high impact potential, regardless of travel reason or class.
Similarly to what we observed in exploratory analysis - low marks of the boarding process, regardless of travel reason had a negative impact on the model output, pointing towards passenger dissatisfaction. Whereas high remarks were consistent with customers who were more or less satisfied with their overall travel experience.
At this point I would be able to confidently present to the client that we have data-informed recommendations on where they can improve their customer service, evidenced by an intepretable XGBoost machine learning pipeline that exhibits 95% accuracy and 94.6% precision.
To summarize, it's very crucial to focus on how efficiently passengers get on the plane, and how they're treated while on board.
There is an fundamental difference in satisfaction given how customers react to certain service qualities given their reason for travel. Most important among these service aspects is the quality of in-flight wi-fi service on board.
Regardless of cabin or travel purpose, high marks in wi-fi capabilities is a huge opportunity for increasing customer satisfaction and expanding competitive advantage. Increasing something like availability, making wi-fi complimentary in all cabins would draw in more customers who have readily available video and music streaming. Improving capability is something that would draw in more business travelers knowing that they can increase productivity while in the air.
The efficiency of getting people on board the plane is also something to be improved upon. This isn't specific to one cabin class or purpose of travel - it's universal. Passengers who provided higher marks in boarding the plane were reported to be satisfied with their travel experience, while passengers who gave mediocre or low marks were not.
Secondary to those features, there's also opportunity to grow in terms of the quality of in-flight entertainment, food and beverage, seat comfort, cabin cleanliness, and leg room.
I think it's important that the airline focus on improving these key service aspects to improve customer satisfaction, customer retention, and to establish or expand their competitive advantage against other carriers.
Chris Bacani
Data Scientist Apprentice, IBM Corporation
2022